在本文中,我们解决了包含人脸和声音的视频中的唇彩同步问题。我们的方法是基于确定视频中的嘴唇运动和声音是否同步,具体取决于其视听对应得分。我们提出了一个基于视听的跨模式变压器模型,该模型在标准的唇读语音基准数据集LRS2上胜过音频视频同步任务中的几个基线模型。尽管现有的方法主要集中在语音视频中的唇部同步上,但我们也考虑了歌声的特殊情况。由于持续的元音声音,唱歌声音是同步的更具挑战性的用例。我们还研究了在唱歌语音的背景下在语音数据集中训练的LIP同步模型的相关性。最后,我们使用在唱歌语音分离任务中通过唇部同步模型学到的冷冻视觉特征,以优于训练有素的端到端的基线音频视觉模型。演示,源代码和预训练的模型可在https://ipcv.github.io/vocalist/上找到。
translated by 谷歌翻译
本文提出了一种语音分离的视听方法,在两种情况下以低潜伏期产生最先进的结果:语音和唱歌声音。该模型基于两个阶段网络。运动提示是通过轻巧的图形卷积网络获得的,该网络处理面对地标。然后,将音频和运动功能馈送到视听变压器中,该变压器对隔离目标源产生相当好的估计。在第二阶段,仅使用音频网络增强了主导语音。我们提出了不同的消融研究和与最新方法的比较。最后,我们探讨了在演唱语音分离的任务中训练训练语音分离的模型的可传递性。https://ipcv.github.io/vovit/可用演示,代码和权重
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译
尽管最近从遮挡和嘈杂的面部图像中的3D面部重建的发展,但性能仍然不满意。主要挑战之一是在面部图像中处理中等至重闭塞。另外,面部图像中的噪声抑制了面部属性的正确捕获,从而需要可靠地解决。此外,大多数现有方法依赖于额外的依赖性,对培训过程构成了许多约束。因此,我们提出了一种自我监督的强制性指导(流氓)框架,以获得面部图像中的遮挡和噪声的鲁棒性。所提出的网络包含1)指导管线,用于获得清洁面的3D面系数,以及2)稳定流水线,以获取封闭或噪声图像的估计系数与清洁对应物之间的估计系数之间的一致性。所提出的图像和特征级损失功能有助于流氓学习过程而不会构成额外的依赖性。在Celeba的测试数据集的三种变化:理性闭塞,妄想闭塞和嘈杂的面部图像,我们的方法优于当前的最先进的方法(例如,基于形状的3D顶点错误,合理闭塞的0.146〜0.048的减少,从0.292〜0.061,妄想闭塞和面部图像中的噪声为0.269至0.053),展示了所提出的方法的有效性。
translated by 谷歌翻译
在Imagenet或其他大规模数据数据上的预培训模型导致计算机愿景的主要进步,尽管伴随着与策划成本,隐私,使用权和道德问题相关的缺点。在本文中,我们首次研究了基于由图形模拟器生成的合成数据到来自非常不同的域的下游任务的培训模型的可转换性。在使用此类合成数据进行预培训时,我们发现不同任务的下游性能受到不同配置的不同配置(例如,照明,对象姿势,背景等),并且没有单尺寸适合 - 所有解决方案。因此,更好地将合成的预训练数据量身定制到特定的下游任务,以获得最佳性能。我们介绍Task2SIM,一个统一的模型将下游任务表示映射到最佳模拟参数,以为它们生成合成的预训练数据。 Task2SIM通过培训学习此映射,以查找一组“看到”任务上的最佳参数集。曾经训练过,它可以用于预测一个新颖的“看不见”任务的最佳仿真参数,而无需额外的培训。鉴于每级图像数量的预算,我们具有20个不同的下游任务的广泛实验,显示了Task2SIM的任务 - 自适应预训练数据导致明显更好的下游性能,而不是在看见和看不见的任务上的非自适应选择模拟参数。它甚至是竞争对手的真实图像的竞争力。
translated by 谷歌翻译
地球天气和气候的数值模拟需要大量的计算。这导致替换替换具有在推理时间快速的近似机器学习(ml)方法的子程序来替换的子程序感兴趣。在天气和气候模型中,大气辐射转移(RT)计算特别昂贵。这使他们成为了基于神经网络的仿真器的流行目标。然而,由于缺乏缺乏全面的数据集和ML基准测试的标准化最佳实践,事先工作难以比较。为了填补这个差距,我们建立一个大型数据集,比加拿大地球系统模型为基础的大型数据集,高于\ emph {1000万个样本,未来的气候条件}。 Climart为ML社区带来了几种方法论挑战,例如多次分发试验集,底层域物理学和准确性和推广速度之间的权衡。我们还提出了几种新颖的基线,这些基线表示现有工作中使用的数据集和网络架构的缺点。下载说明,基准和代码可提供:https://github.com/rolnicklab/climart
translated by 谷歌翻译
Designing experiments often requires balancing between learning about the true treatment effects and earning from allocating more samples to the superior treatment. While optimal algorithms for the Multi-Armed Bandit Problem (MABP) provide allocation policies that optimally balance learning and earning, they tend to be computationally expensive. The Gittins Index (GI) is a solution to the MABP that can simultaneously attain optimality and computationally efficiency goals, and it has been recently used in experiments with Bernoulli and Gaussian rewards. For the first time, we present a modification of the GI rule that can be used in experiments with exponentially-distributed rewards. We report its performance in simulated 2- armed and 3-armed experiments. Compared to traditional non-adaptive designs, our novel GI modified design shows operating characteristics comparable in learning (e.g. statistical power) but substantially better in earning (e.g. direct benefits). This illustrates the potential that designs using a GI approach to allocate participants have to improve participant benefits, increase efficiencies, and reduce experimental costs in adaptive multi-armed experiments with exponential rewards.
translated by 谷歌翻译
Quadruped robots are currently used in industrial robotics as mechanical aid to automate several routine tasks. However, presently, the usage of such a robot in a domestic setting is still very much a part of the research. This paper discusses the understanding and virtual simulation of such a robot capable of detecting and understanding human emotions, generating its gait, and responding via sounds and expression on a screen. To this end, we use a combination of reinforcement learning and software engineering concepts to simulate a quadruped robot that can understand emotions, navigate through various terrains and detect sound sources, and respond to emotions using audio-visual feedback. This paper aims to establish the framework of simulating a quadruped robot that is emotionally intelligent and can primarily respond to audio-visual stimuli using motor or audio response. The emotion detection from the speech was not as performant as ERANNs or Zeta Policy learning, still managing an accuracy of 63.5%. The video emotion detection system produced results that are almost at par with the state of the art, with an accuracy of 99.66%. Due to its "on-policy" learning process, the PPO algorithm was extremely rapid to learn, allowing the simulated dog to demonstrate a remarkably seamless gait across the different cadences and variations. This enabled the quadruped robot to respond to generated stimuli, allowing us to conclude that it functions as predicted and satisfies the aim of this work.
translated by 谷歌翻译
Real-world robotic grasping can be done robustly if a complete 3D Point Cloud Data (PCD) of an object is available. However, in practice, PCDs are often incomplete when objects are viewed from few and sparse viewpoints before the grasping action, leading to the generation of wrong or inaccurate grasp poses. We propose a novel grasping strategy, named 3DSGrasp, that predicts the missing geometry from the partial PCD to produce reliable grasp poses. Our proposed PCD completion network is a Transformer-based encoder-decoder network with an Offset-Attention layer. Our network is inherently invariant to the object pose and point's permutation, which generates PCDs that are geometrically consistent and completed properly. Experiments on a wide range of partial PCD show that 3DSGrasp outperforms the best state-of-the-art method on PCD completion tasks and largely improves the grasping success rate in real-world scenarios. The code and dataset will be made available upon acceptance.
translated by 谷歌翻译
When robots learn reward functions using high capacity models that take raw state directly as input, they need to both learn a representation for what matters in the task -- the task ``features" -- as well as how to combine these features into a single objective. If they try to do both at once from input designed to teach the full reward function, it is easy to end up with a representation that contains spurious correlations in the data, which fails to generalize to new settings. Instead, our ultimate goal is to enable robots to identify and isolate the causal features that people actually care about and use when they represent states and behavior. Our idea is that we can tune into this representation by asking users what behaviors they consider similar: behaviors will be similar if the features that matter are similar, even if low-level behavior is different; conversely, behaviors will be different if even one of the features that matter differs. This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not. The notion of learning representations based on similarity has a nice parallel in contrastive learning, a self-supervised representation learning technique that maps visually similar data points to similar embeddings, where similarity is defined by a designer through data augmentation heuristics. By contrast, in order to learn the representations that people use, so we can learn their preferences and objectives, we use their definition of similarity. In simulation as well as in a user study, we show that learning through such similarity queries leads to representations that, while far from perfect, are indeed more generalizable than self-supervised and task-input alternatives.
translated by 谷歌翻译